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Abstract 

In certain situations that shall be undoubtedly more and more common in 
the Big Data era, the datasets available are so massive that computing statis¬ 
tics over the full sample is hardly feasible, if not unfeasible. A natural ap¬ 
proach in this context consists in using survey schemes and substituting the 
’’full data” statistics with their counterparts based on the resulting random 
samples, of manageable size. It is the main purpose of this paper to in¬ 
vestigate the impact of survey sampling with unequal inclusion probabilities 
on stochastic gradient descent-based M-estimation methods in large-scale 
statistical and machine-learning problems. Precisely, we prove that, in pres¬ 
ence of some a priori information, one may signihcantly increase asymptotic 
accuracy when choosing appropriate hrst order inclusion probabilities, with¬ 
out affecting complexity. These striking results are described here by limit 
theorems and are also illustrated by numerical experiments. 

Keywords: statistical learning; survey schemes; sampling designs; 
stochastic gradient descent; Horvitz-Thompson estimation 


1. Introduction 

In many situations, data are not the sole information that can be exploited 
by statisticians. Sometimes, they can also make use of weights resulting from 
some survey sampling design. Such quantities correspond either to true in¬ 
clusion probabilities or else to calibrated or post-stratihcation weights, min- 
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imizing some discrepancy under certain margin constraints for the inclusion 
probabilities. Asymptotic analysis of Horvitz-Thompson estimators based on 
survey data (see [21]) has received a good deal of attention, in particular in 
the context of mean estimation and regression (see [23],|33|, [32], [20], [1] for 
instance). The last few years have witnessed signihcant progress towards a 
comprehensive functional limit theory for distribution function estimation, 
refer to [22], [H], [15], [13], [M] or [5]. In parallel, the held of machine¬ 
learning has been the subject of a spectacular development. Its practice 
has become increasingly popular in a wide range of helds thanks to various 
breakout algorithms {e.g. neural networks, SVM, boosting methods) and is 
supported by a sound probabilistic theory based on recent results in the study 
of empirical processes, see m, 1251. 1121. However, our increasing capacity 
to collect data, due to the ubiquity of sensors, has improved much faster than 
our ability to process and analyze Big Datasets, see m- The availability of 
massive information in the Big Data era, which machine-learning procedures 
could theoretically now rely on, has motivated the recent development of 
parallelized/distributed variants of certain statistical learning algorithms, see 
0.1221 , [2H] or [7] among others. It also strongly suggests to use survey tech¬ 
niques, as a remedy to the apparent intractability of learning from datasets of 
explosive size, in order to break the current computational barriers, see [16] . 
The present article explores the latter approach, following in the footsteps 
of [TB], where the advantages of specihc sampling plans compared to naive 
sub-sampling strategies are proved when the risk functional is estimated by 
generalized U -statistics. 

Our goal is here to show how to incorporate sampling schemes into it¬ 
erative statistical learning techniques based on stochastic gradient descent 
(SGD in abbreviated form, see [TU]) such as SVM, Neural Networks 
or SOFT A'-means for instance and establish (asymptotic) results, in order 
to guarantee their theoretical validity. The variant of the SGD method we 
propose involves a specihc estimator of the gradient, which shall be referred 
to as the Horvitz-Thompson gradient estimator (HTGD estimator in abbre¬ 
viated form) throughout the paper and accounts for the sampling design by 
means of which the data sample has been selected at each iteration. For 
the estimator thus produced, consistency and asymptotic normality results 
describing its statistical performance are established under adequate assump¬ 
tions on the hrst and second order inclusion probabilities. They reveal that 
accuracy may signihcantly increase {i.e. the asymptotic variance may be 
drastically reduced) when the inclusion probabilities of the survey design are 
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picked adequately, depending on some supposedly available extra informa¬ 
tion, compared to a naive implementation with equal inclusion probabilities. 
This is thoroughly discussed in the particular case of the Poisson survey 
scheme. Although it is one of the simplest sampling designs, many more 
general survey schemes may be expressed as Poisson schemes conditioned 
upon specihc events. We point out that statistical learning based on non 

1.1. d. data has been investigated in [35] (see also [1] for analogous results in 
the on-line framework). However, the framework considered by these authors 
relies on mixing assumptions, guaranteeing the weak dependency of the data 
sequences analyzed, and is thus quite different from that developed in the 
present article. We point out that a very preliminary version of this work 
has been presented at the 2014 IEEE International Conference on Big Data. 

The rest of the paper is structured as follows. Basics in M-estimation 
and SGD techniques together with key notions in survey sampling theory are 
briefly recalled in section Section Erst describes the Horvitz-Thompson 
variant of the SGD in the context of a general M-estimation problem. In 
section limit results are established in a general framework, revealing the 
possible signihcant gain in terms of asymptotic variance resulting from sam¬ 
pling with unequal probabilities in presence of extra information. They are 
next discussed in more depth in the specihc case of Poisson surveys. Illustra¬ 
tive numerical experiments, consisting in htting a logistic regression model 
(respectively, a semi-parametric shift model) with extra information, are dis¬ 
played in section!^ Technical proofs are postponed to the Appendix section, 
together with a rate bound analysis of the HTGD algorithm. 

2. Theoretical background 

As a hrst go, we start off with describing the mathematical setup and 
recalling key concepts in survey theory involved in the subsequent analysis. 
Here and throughout, the indicator function of any event B is denoted by 
I{H}, the Dirac mass at any point a by 5a and the power set of any set E 
by V{E). The euclidean norm of any vector x G d > 1, is denoted by 
||a:|| = (X^f=i The transpose of a matrix A is denoted by , the 

square root of any symmetric semi-dehnite positive matrix B by 

2.1. Iterative M-estimation and SGD methods 

Let 0 C with g > 1 be some parameter space and -0 : x 0 —M be 

some smooth loss function. Let Z be a random variable taking its values in 
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such that 'ipiZ, 6) is square integrable for any 0 G 0. Set L{6) = E['0(2', 6)] 
for all 0 G 0. Consider the risk minimization problem 

minL( 6 '). 

eee 

Based on independent copies Zi, ..., Zjy of the r.v. Z, the empirical version 
of the risk function is 6 ^ G 0 e-)■ L 7 v( 6 '), where 

1 ^ 
i=l 

for all 6* G 0. As iV — )■ +C) 0 , asymptotic properties of M-estimators, i.e. 
minimizers of Ljv(0), have been extensively investigated, see Chapter 5 in 
[36] for instance. Here and throughout, we respectively denote by and 
Vg the gradient and Hessian operators w.r.t. 6. By convention, Vg denotes 
the identity operator and gradient values are represented as column vectors. 


Gradient descent. Concerning computational issues (see 0). many practi¬ 
cal machine-learning algorithms implement variants of the standard gradient 
descent method, following the iterations: 

e{t + i) = e{t)--f{t)VeLN{e{t)), (i) 

with an initial value 0(0) arbitrarily chosen and a learning rate (step size 
or gain) 'y(t) > 0 such that XlS 7(^) = < -foo. Here 

we place ourselves in a large-scale setting, where the sample size N of the 
training dataset is so large that computing the gradient of L^v 

1 ^ 

( 2 ) 

i=l 

at each iteration ([^ is too demanding regarding available memory capacity. 
Beyond parallel and distributed implementation strategies (see 0 ) , a nat¬ 
ural approach consists in replacing ([^ by a counterpart computed from a 
subsample S' C {1, .. ., Nj of reduced size n « N, hxed in advance so 
as to fulhll the computational constraints, and drawn at random (uniformly) 
among all possible subsets of same size: 

ln(9) = -y2VeHZi,9). ( 3 ) 

^ ieS 
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The convergence properties of such a stochastic gradient descent, usually 
referred to as mini-batch SGD have received a good deal of attention, in 
particular in the case n = 1, suited to the on-line situation where training 
data are progressively available. Results, mainly based on stochastic ap¬ 
proximation combined with convex minimization theory, under appropriate 
assumptions on the decay of the step size 7 (t) are well-documented in the 
statistical learning literature. References are much too numerous to be listed 
exhaustively, see |26] for instance. 

Example 1. (Binary classification) We place ourselves in the usual 
binary classification framework, where Y is a binary random output, taking 
its values in {—1, -1-1} say, and X is an input random vector valued in a 
high-dimensional space X, modeling some (hopefully) useful observation for 
predicting Y. Based on training data {(Xi,!^), ..., {Xj^,Yn)}, the goal is 
to build a prediction rule sign(h(X)), where h : X —)■ M is some measurable 
function, which minimizes the risk 

L.,(h) = E[^(-Yh(X))\, 

where expectation is taken over the unknown distribution of the pair of r.v. ’s 
{X,Y) and : M —)■ [0, -|-cxd) denotes a cost function (i.e. a measurable 
function such that ip{u) > I{u > 0} for any u eM.). For simplicity, consider 
the case where decision function candidates h[x) are assumed to belong to the 
parametric set of sguare integrable (with respect to X’s distribution) functions 
indexed by Q G M.'^, q > I, {h{., 9), 0 G 0} and the convex cost function is 
(p{u) = {u + 1)^/2. Notice that, in this case, the optimal decision function 
is given by: \/x G X, h*{x) = 2P{y = -|-1 | X = a:} — 1. The classification 
rule H*{x) = sign{h*{x)) thus coincides with the naive Bayes classifier. We 
abusively set L^{6) = L^{h{., 6)) for all 6* G 0. Consider the problem of 
finding a classification rule with minimum risk, i.e. the optimization problem 
L^{6). In the ideal case where a standard gradient descent could be 
applied, one would iteratively generate a sequence 6{t) = {6i{t),--- ,9d{t)), 
t > 1, satisfying the following update equation: 

e{t + 1) = e{t) + 7(t) E [YXgh{X, e{t))g:\-YH{X, 0(t)))], 

where 'y{t) > 0 is the learning rate. Naturally, as (X, X)’s distribution is 
unknown, the expectation involved in the t-th iteration cannot be computed 
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and must be replaced by a statistical version, 

N 

{1/N) J2{y^eh{X„ e{t))ip'{-Y,H{X,, e{t)))} 

i=l 

in accordance with the Empirical Risk Minimization paradigm. This is a 
particular case of the problem previously described, where Z = iX, Y) and 
ij{z,e) = if{-Yh{x,e)). 

Example 2. (Logistic regression) Consider the same probabilistic model 
as above, except that the goal pursued is to find 9 E Q so as to minimize 

/ Lti log f + inL log (_^_ 

which is nothing else than the opposite of the conditional log-likelihood given 
the Xi ’s related to the parametric logistic regression model: 9 E Q, 

= +1 I X} = exp{h{X, 9))/{l + exp{h{X, 9))). 

2.2. Survey sampling 

Let P) be a probability space and N > 1. In the framework we 

consider throughout the article, it is assumed that Zi, ..., Z^ is a sample of 
i.i.d. random variables defined on P), taking their values in W^. The 

Zfs correspond to independent copies of a generic r. v. Z observed on a hnite 
population Wat := {1, • • •, N}. We call a survey sample of (possibly random) 
size n < iV of the population Un, any subset s := {ii,... ,in{s)} ^ ViplN) 
with cardinality n =: n{s) less that N. Given the statistical population Wat, 
a sampling scheme (design/plan) without replacement is determined by a 
probability distribution on the set of all possible samples s E V{Ujq). 
For any f G {1, ..., N}, the (first order) inclusion probability, 

'^fiRn) ■= ^ “S'}, 

is the probability that the unit i belongs to a random sample S drawn from 
distribution R^. We set 7z{Rn) := (nfiRN), ■ ■ ■ ,nN{RN))- The second order 
inclusion probabilities are denoted by 

Mj{RN) ■= PRiv{(b j) e S^}, 
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for any {i,j) in {1,, iV}^. Eqnipped with these notation, we have = vTj 
for 1 < i < N. When no confusion is possible, we shall omit to mention the 
dependence in Rn when writing the hrst/second order probabilities of inclu¬ 
sion. The information related to the resulting random sample S C {1,..., 
is fully enclosed in the r.v. := (ci,..., Cat), where Cj = I{z G S}. Given 
the statistical population, the conditional 1-d marginal distributions of the 
sampling scheme are the Bernoulli distributions = 7ii6i + (1 — vrj)(5o, 

1 < i < iV, and the conditional covariance matrix of the r.v. e^v is given 
by rAT := {vTij — j<]^- Observe that, equipped with the notations 

above, = n{S). 

One of the simplest survey plans is the Poisson scheme (without replace¬ 
ment). For such a plan Tat, conditioned upon the statistical population of 
interest, the e^s are independent Bernoulli random variables with parameters 
Pi, ..., pAf in (0,1). The hrst order inclusion probabilities thus character¬ 
ize fully such a plan: equipped with the notations above, 7rj(T/v) = Pi for 
i G {1, ..., N}. Observe in addition that the size n{S) of a sample gener¬ 
ated this way is random with mean Pi nnd goes to inhnity as N ^ -|-cx) 
with probability one, provided that mini<j<ArPi remains bounded away from 
zero. In addition to its simplicity (regarding the procedure to select a sam¬ 
ple thus distributed), it plays a crucial role in sampling theory, insofar as 
it can be used to build a wide range of survey plans by conditioning ar¬ 
guments, see |23]. For instance, a rejective sampling plan of size n < N 
corresponds to the distribution of a Poisson scheme conditioned upon the 
event = n}. One may refer to [IT], [12] for accounts of survey sam¬ 

pling techniques and examples of designs to which the subsequent analysis 
applies. 


2.3. The Horvitz-Thompson estimator 

Suppose that independent r.v.’s Qi, ..., Qat, copies of a generic vari¬ 
able Q taking its values in are observed on the population A 

natural approach to estimate the total Qat = Qi based on a sample 
S' C {1, ..., N} generated from a survey design Rjq with (hrst order) in¬ 
clusion probabilities {7rj}i<j<Ar consists in computing the Horvitz-Thompson 
estimator (HT estimator in abbreviated form) 


i&S 


1 

vr,: 


N 


QS = E -«• = E -«■. 


i=l 


VTi 


( 4 ) 
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with 0/0 = 0 by convention. Notice that, given the whole statistical popu¬ 
lation Qi, ..., Qn, the HT estimator is an unbiased estimate of the total: 

I Qi) • • •) Qn] = Qn almost-surely. When samples drawn from Rjsf 
are of hxed size, the conditional variance is given by: 


var 


i<j 


{QrI \Qu • • ■, Qn) ~ Ki - 


Qj 


(5) 


When the survey design is a Poisson plan T/v with probabilities pi, ..., pn, 
it is given by: 


var (Q^J I Qi, 


N 


Qn) - 


Pi 


2 = 1 


Pi 


WQi 


( 6 ) 


Remark 1. (Auxiliary information) In practice, the first order inclu¬ 
sion probabilities are defined as a function of an auxiliary variable, W taking 
its values in say, which is observed on the entire population (e.g. a d'- 
dimensional marginal vector Z' for instance): for all i & {1,..., N} we can 
write Hi = niWi) for some link function vr : —)■ (0,1). When W and Q 

are strongly correlated, proceeding this way may help us select more infor¬ 
mative samples and conseguently yield estimators with reduced variance. A 
more detailed discussion on the use of auxiliary information in the present 


context can be found in subsection f.l 


Going back to the SGD problem, the Horvitz-Thompson estimator oi the 
gradient iN^d) based on a survey sample S drawn from a design with 
(hrst order) inclusion probabilities {7rj}i<j<Ar is: 


As pointed out in Remark ideally, the quantity TTj should be strongly 
correlated with V Hence, this leads to consider a procedure where 
the survey design used to estimate the gradient may change at each step, 
as in the HTGD algorithm described in the next section. For instance, 
one could stipulate the availability of extra information taking the form of 
random helds on a space W, {hFi(6')}ege with 1 < ^ < and assume the 
existence of a link function tt : W —)■ (0,1) such that vr* = 7i{Wi{9)). Of 
course, such an approach is of beneht only when the cost of the computation 




of the weight 7i(Wi{9)) is smaller than that of the gradient Ve'ipiZi^O). As 
shall be seen in section this happens to be the case in many sitnations 
enconntered in practice. 

3. Horvitz-Thompson gradient descent 

This section presents, in fnll generality, the variant of the SGD method 
we promote in this article. It can be implemented in particnlar when some 
extra information abont the target (the gradient vector held in the present 
case) is available, allowing hopefnlly for picking a sample yielding a more 
accnrate estimation of the (trne) gradient than that obtained by means of 
a sample chosen completely at random. Several timing parameters mnst be 
picked by the user, including the parameter Nq which controls the number 
of terms involved in the empirical gradient estimation at each iteration, see 
Fig. [1} 

The asymptotic accuracy of the estimator or decision rule produced by 
the algorithm above as T —)■ +cx) is investigated in the next section under 
specihc assumptions. 

Remark 2. (Balance between accuracy and computational cost) 
We point out that the complexity of any Poisson sampling algorithm is 0{N), 
just as in the usual case where data involved in the standard SGD are uni¬ 
formly drawn with(out) replacement. However, even if it can he straightfor¬ 
wardly parallelized, the numerical computation of the inclusion probabilities 
at each step naturally induces a certain amount of additional latency. Hence, 
although HTGD may largely outperform SGD for a fixed number of iterations, 
this should be taken into consideration for optimizing computation time. 
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Horvitz-Thompson Gradient Descent Algorithm (HTGD) 


(Input.) Datasets {Zi, Zj^} and {Wi, Wn}. Maximum 
(expected) sample size Nq < N. Collection of sampling plans i?Ar(0) 
with first order inclusion probabilities 7rj(0) for 1 < i < N, indexed by 
6 £ Q with (expected) sample sizes less than Nq. Learning rate 7 (t) > 0. 
Number of iterations T > 1. 

1. (Initialization.) Choose 9{0) in 0. 


2. (Iterations.) For t = 0, ..., T 

(a) Draw a survey sample S = St, described by = (e^*\ ..., e^^) 
according to RN{0{t)) with inclusion probabilities 'Ki{9{t)) for i = 

1, ..., A. 

(b) Compute the HT gradient estimate at 9{t) 




N 






(c) Update the estimator 

9{t + 1) = 9{t) - -f{t) l^^{9{t)). 


(Output.) The HTGD estimator 9{T). 


Figure 1: The generic HTGD algorithm 


4. Main results 

This section is dedicated to analyze the performance of the HTGD 
method under adequate constraints, related to the (expected) size of the 
survey samples considered. We first focus on Poisson survey schemes and 
next discuss how to establish results in a general framework. 
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4-1. Poisson schemes 

Fix 9 E Q and ^ (0,iV). Given the sample Zi, consider 

a Poisson scheme with parameter p = (pi, ..., p^)- In this case, Eq. ([^ 
becomes: 


E 


JHT 


(9)-I 


N 


Zi, 


-‘N 


1 


N 

E 

2=1 


Pi, 


Pi 




Searching for the parameters pi, ..., pn snch that the L 2 distance between 
the empirical gradient evalnated at 9 and the HT version given Zi, ..., 
is minimum under the constraint that the expected sample size is equal to 
Nq G [0, N] yields the optimization problem: 


N 

min > 

pe(o,i)^ “ 
2 = 1 


1 -Pi 
Pi 




N 

S.t. '^Pi = Nq. 
2=1 


( 8 ) 


As can be shown by means of the Lagrange multipliers method, the solution 
corresponds to weights being proportional to the values taken by the norm 
of the gradient: 

\\'^eipiZi,i 


P^{9) No 


2^7 = 1 


\Vgi>{z,, 


(9) 


However, selecting a sample distributed this way requires to know the full 
statistical population Ve'^{Zi,9). In practice, one may consider situations 
where the weights are defined by means of a link function n{W, 9) and auxil¬ 
iary variables Wi, ..., Wn correlated with the Zj’s, as suggested previously. 
Observe in addition that the goal pursued here is not to estimate the gra¬ 
dient but to implement a stochastic gradient descent involving an expected 
number of terms hxed in advance, while yielding results close to those that 
would be obtained by means of a gradient descent algorithm with mean held 
(1/iV) 9) based on the whole dataset. However, as shall be seen 

in the subsequent analysis, in general these two problems do not share the 
same solution from the angle embraced in this article. 

In the next subsection, assumptions on the survey design under which the 
HTGD method yields accurate asymptotic results, surpassing those obtained 
with equal inclusion probabilities (he. pi = A"o/A^ for alH G {1, ..., N}), 
are exhibited. 
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4-2. Limit theorems 

We now consider a collection of general {i.e. not necessarily Poisson) 
sampling designs {-RAr(6')}6iee and investigate the limit properties of the M- 
estimator produced by the HTGD algorithm conditioned upon the data 
Vn = {Zi, ..., Zjv} (or Pat = {(Zi, Wi), ..., (Wjv, Z^)} in presence of 
extra variables, cf Remark [^. The asymptotic analysis involves the reg¬ 
ularity conditions listed below, which are classically required in stochastic 
approximation. 


Assumption 1. The conditions below hold true. 

• For any z, 9 ilj{z, 6) is of class . 

• For any compact set /C C we have with probability one: Wi G 

N}, 


sup 

eeK. 


V.^(z„0)|| 


vri(6') 


< +CXD. 


• The set of stationary points Cn = {9 : VgLN{9) = 0} is of finite 
cardinality. 


Theorem 1. (Consistency) Assume that the learning rate decays to 0 so 
that X]t>i7(^) = Si>o7^(^) ^ + 00 . Suppose also that the HTGD 

algorithm is stable, i.e. there exists a compact set /C C s.t. 9{t) G /C 
for all t > 0. Under Assumption conditioned upon the data Fjsi, the 
seguence {9{t)}t>o converges to an element of the set Cn with probability 
one, as t ^ +cxd. 


The stability condition is generally difficult to check. In practice, one may 
guarantee it by conhning the sequence to a compact set hxed in advance and 
using a projected version of the algorithm above. For simplicity, the present 
study is restricted to the simplest framework for stochastic gradient descent 
and we refer to [26] or [9| (see section 5.4 therein) for further details. 

Consider 9* E C. The following local assumptions are also required to 
establish asymptotic normality results conditioned upon the event £{9*) = 
{linit^+oo ^(^) =9*}. 

Assumption 2. The conditions below hold true. 

• There exists a neighborhood V of 9* such that for all z, the mapping 
9 H-)■ fj^z, 9) is of class on V. 
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• The Hessian matrix H = V'^Lp^iO*) is a stable q x q positive-definite 
matrix: its smallest eigenvalue is I with / > 0 . 

• For all {i,j) G {1, ..., N}"^, the mapping 6 * G V h-)■ 7iij{9) is continu¬ 
ous. 

Theorem 2. (A conditional CLT) Suppose that Assumptions [I]{^ are 
fulfilled and that'y{t) = ■y{0)t~^ for some constants ■y{0) > 0 and a G (1/2,1]- 
When 0 = 1, take 7 ( 0 ) > 1 /( 2 /) and rj = l/( 27 ( 0 )). Set 7 = 0 otherwise. 
Given the observations Zi, ..., (respectively, (Zi, Wfi), ..., (Z^v, Wn)) 
and conditioned upon the event £{6*), we have the convergence in distribution 
as t ^ +CX) 

0/7(4) (9(i)-9-) ^A4(0,EJ, 

where the asymptotic covariance matrix is the unique solution of the Lya¬ 
punov equation: 

HT + TH + 2riT = T*, ( 10 ) 

with 


N 






S' ’"•I"*) 


1 


E 




( 11 ) 




The result stated below provides the asymptotic conditional distribution 
of the error. Its proof is a direct application of the second order delta method 
and is left to the reader. 


Corollary 1. (ERROR RATE) Under the hypotheses of Theorem given 
the observations Zi, ..., Zjsf (respectively, {Zi,Wi), ..., {Zisi,Wn)) and 
conditioned upon the event £{0*), we have the convergence in distribution 
towards a non-central chi-square distribution: 

1 / 7 ( 4 ) (£ 7 («( 4 )) - ^ twsl/ws'fu, 

as t ^ +CX), where U is a d-dimensional Gaussian centered r.v. with the 
identity as covariance matrix. 
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Before showing how the results above can be used to understand how 
specihc sampling designs may improve statistical analysis, a few comments 
are in order. 

Remark 3. (Asymptotic covariance estimation) An estimate ofTiT^ could he 
obtained by solving the equation SR+RS + 2?7S = r(6'(T)), replacing in 0 
the (unknown) target value 9* by the estimate produced by the HTGD algo¬ 
rithm after T iterations. Alternatively, a pereentile Bootstrap method could 
be also used for this purpose, repeating B > 1 times the HTGD algorithm 
based on replicates of the original sample Vjq. 

For completeness, a rate bound analysis of the HTGD algorithm is also 
provided in the Appendix section. 


4 . 3 . Asymptotic covariance optimization in the Poisson case 

Now that the limit behavior of the solution produced by the HTGD 
algorithm has been described for general collections of survey designs TZ = 
{Riv(6')}6)ee of fixed expected sample size, we turn to the problem of finding 
survey plans yielding estimates with minimum variability. Formulating this 
objective in a quantitative manner, this boils down to finding TZ so as to 
minimize ||Sy^||, for an appropriately chosen norm ||.|| on the space Adq(M) 
of g X g matrices with real entries for instance. In order to get a natural 
summary of the asymptotic variability, we consider here the Hilbert-Schmidt 
norm, i.e. ||A||hs = ^jTr{AA^) = for any A = (uij) e Md(R) 

where Tr(.) denotes the Trace operator. For simplicity’s sake, we focus on 
Poisson schemes and consider the case where p = 0. Let P = {p(6') = 
(pi(9), ..., pjv(9)}dee be a collection of first order inclusion probabilities. 
The following result exhibits an optimal collection of Poisson schemes among 
those with Nq as expected sizes, in the sense that it yields an HTGD estimator 
with an asymptotic covariance of square root with minimum Hilbert-Schmidt 
norm. We point out that it is generally different from that considered in 


subsection 4.1 revealing the difference between the issue of estimating the 
empirically gradient accurately by means of a Poisson Scheme and that of 
optimizing the HTGD procedure. 

Proposition 1. (Optimality) Let Q = . The collection p* of Pois¬ 

son designs defined by: Vi G {1, ..., N}, \/9 G 0, 


P:iO)=No: 
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is a solution of the minimization problem 

N 

min subject to Pi{0) = -^o for all 9 G Q, 

2=1 

where the infimum is taken over all collections p of Poisson designs. In 
addition, we have 

2||Sp-'|Ihs= 

i<j 

Of course, the optimal solution exhibited in the result stated above is 
completely useless from a practical perspective, since the matrix H is un¬ 
known in practice and the computation of the values taken by the gradient 
at each point Zi is precisely what we are trying to avoid in order to reduce 
the computational cost of the SGD procedure. In the next section, we show 
that choosing inclusion probabilities positively correlated with the p*{6)^s is 
actually sufficient to reduce asymptotic variability (compared to the situation 
where equal inclusion probabilities are used). In addition, as illustrated by 
the two easily generalizable examples described in section such a sampling 
strategy can be implemented in many situations. 

f-f- Extensions to more general Poisson survey designs 

In this subsection, we still consider Poisson schemes and the case rj = 0 for 
simplicity and now place ourselves in the situation where the information at 
disposal consists of a collection of i.i.d. random pairs (Zi, hPi), ..., {Zj^,Wn) 
valued in x We consider inclusion probabilities 


MO) = No 


p{Wi,0) 

E7.iP(1U.9) 


defined through a link function p : x 0 —)■ (0,1), see Remark The 

computational cost of p{Wi, 9) is assumed to be much smaller than that of 
Wef){Zi, 9) (see the examples in section]^ below) for all (i, 0) G {1, ..., N} x 
0. The assumption introduced below involves the empirical covariance cn{9) 
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between \ \QVe'ip{Z9)\\‘^/p{W,9) and p{W,9), for 6* G 0. Observe that it can 
be written as: 


i=l 

with 0 G 0. 



2=1 


\\QVei’{Z^,9)\\^ 

pm, 9) 


N 




Assumption 3. The link function p{w, 9) fulfills the following condition: 

m9*) > 0 . 

The result stated below reveals to which extent sampling with inclusion 
probabilities dehned by some appropriate link function may improve upon 
sampling with equal inclusion probabilities, pi = Nq/N for 1 < i < n, 
when implementing stochastic gradient descent. Namely, the accuracy of the 
HTGD gets closer and closer to the optimum, as the empirical covariance 
cn{9*) increases to its maximum. Notice that in the case where inclusion 
probabilities are all equal, we have c^v = 0. 

Proposition 2. Let Nq be fixed. Suppose that the collection of Poisson 
designs p with expected sizes Nq is defined by a link function p{w, 9) satisfying 
Assumption\^ Then, we have 

as well as 

0 < iisf^ii™ - iiSp'-'iiL = ^ Wuin - =»(«•)}. 

where o'%{9) denotes the empirical variance of the r.v. \\Ve'f>{Z,9)\\. 

As illustrated by the easily generalizable examples provided in the next 
section, one may generally hnd link functions fulhlling Assumption [^without 
great effort, permitting to gain in accuracy from the implementation of the 
HTGD algorithm. 


5. Illustrative numerical experiments 

For illustration purpose, this section shows how the results previously 
established apply to two problems by means of simulation experiments. For 
both examples, the performance of the HTGD algorithm is compared with 
that of a basic SGD strategy with the same (mean) sample size. 
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5.1. Linear logistic regression 

Consider the linear logistic regression model corresponding to Example 
with 6^ = (a, /9) G M X and h[x, 6) = a + fl'^x for all x G M.^. Let X' be a 
low dimensional marginal vector of the inpnt r.v. X, of dimension d! « d 
say, so that one may write X = {X',X'') as well as (3 = {(d', (d") in a similar 
manner. The problem of htting the parameter 9 throngh conditional MLE 
corresponds to the case 


'<P{{x,y),e) = - log 


ga+/3^x(^ + l)/2 + (1 _ y)/2 
1 + 


We propose to implement the HTGD with the link fnnction p{{x',y),6) = 
\\Xg^lj\{X,Y),e)\l where 


f{{x,y),9) 


-log 


' e -+d'^-\y + l)/2 + {l-y)/ 2' 
1 + 6 “+^'"^^') 


In order to illustrate the advantages of the HTGD technique for logistic 
regression, we considered the toy numerical model with parameters d = 11 
and 9 = (a, /3i,...,/?io) = (—9, 0, 3, —9,4, —9,15, 0, —7,1, 0), the 10 input 
variables being independent, uniformly distributed on (0,1). The maximum 
likelihood estimators of 9 were computed using the HTGD and SGD (mini¬ 
batch) . In order to compare them, the same number of iterations was chosen 
in each situation and a learning rate proportionnal to 1 /\/i was considered. 

As a hrst go, we drew a single sample of size N = 5000 on which the two 
algorithms were performed for 2000 iterations. Two sub-sample sizes were 
considered : n = 10 and n = 100. As can be seen on Fig. while virtually 
equivalent in terms of computation time, thus taking a larger sample improves 
the efficiency of the HTGD. It also appears to reach a better level of precision 
in less steps than both competitors, a phenomenon that is consistent on all 
11 coordinates of 9. 

So as to account for the randomness due to the data, we then simulated 
50 samples according to the model for two population sizes, N = 500 and 
N = 1000. For both the HTGD and the mini-batch SGD algorithms, a sub¬ 
sample size of 20 was chosen. As shown in Table [T| the HTGD seems to 
be more robust to data randomness than SGD and GD. It is not surprising, 
since the sampling phase selects the most informative observations relative 
to the gradient descent, which makes HTGD less sensitive to the possible 
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Figure 2: Evolution of the estimator of with the number of iterations in the HTGD 
(solid), mini-batch SGD (dotted) and GD (dashed) algorithms with n = 10 (left) and 
n = 100 (right) 


noise. It also provides more precise estimates, as suggested by the results in 
Table [H 



N = 500 

N = 1000 

HTGD 

1.52 

1.45 

SGD 

2.21 

2.09 


Table 1: Mean standard deviations of the final estimates of 0(= —9) across the 50 simu¬ 
lations 



Min. 

Median 

Max. 

Mean 

S.D. 

HTGD 

^5 

-9.5 

-8.7 

-7.8 

-8.6 

1.45 

Oe 

13.3 

14.6 

15.9 

14.5 

1.52 

SGD 

^5 

-9.9 

-8.2 

-7.4 

-8.2 

2.09 

Oe 

12.7 

13.9 

16.6 

15.2 

2.21 


Table 2: Statistics on the global behavior of the final estimates of ds and j3e across the 50 
simulations 
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Figure 3: 50 trajectories of the estimator of /^s with the number of iterations in the HTGD 
(solid), mini-batch SGD (dotted) over 50 populations (left) and of dg over 1 populations 
(right) 
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5.2. The symmetric model 

Consider now an i.i.d. sample (Xi, X 2 , ..., Xjv) drawn from an nn- 
known probability distribution on supposed to belong to the semi-parametric 
collection {Pej, 9 G 0}, 0 C dominated by some a-finite measure A. 
The related densities are denoted by f{x — 9), where 6* G 0 is a location 
parameter and a f{x) a (twice differentiable) density, symmetric about 0, 
i.e.f{x) = f{—x). The density / is unknown in practice and may be mul¬ 
timodal. For simplicity, we assume here that 0 C M but similar arguments 
can be developed when d > 1. For such a general semi-parametric model, it 
is well-known that neither the sample mean nor the median (if, for instance, 
the distribution does not weight the singleton {0}) are good candidates for 
estimating the location parameter 9. In the semiparametric literature this 
model is referred to as the symmetric model, see [H]. It is known that the 
tangent space {i.e. the set of scores) with respect to the parameter of inter¬ 
est 9 and that with respect to the nuisance parameter are orthogonal. The 
global tangent space at Pgj is given by 



where P 2 is the tangent space with respect to the nuisance parameter: 


P 2 = {h : Epgj.[/;,(X)] = 0, h{x) = h{—x) and Ep^^[h^(X)] < cx)} . 

Orthogonality simply results from the fact that f'{x) is an odd function and 
implies that the parameter 9 can be adaptively estimated, as if the density 
f{x) was known, refer to |H] for more details. In practice f{x) is estimated 
by means of some symmetrized kernel density estimator. Given a Parzen- 
Rosenblatt kernel K{x) {e.g. a Gaussian kernel) for instance, consider the 
estimate 



where /iat > 0 is the smoothing bandwidth, and form its symetrized version 
(which is an even function) 
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The related score is given by 


SN{x,e) = ■^fe,N{x)/fe,N{x). 

In order to perform maximum likelihood estimation approximately, one can 
try to implement a gradient descent method to get an efficient estimator of 
9. For instance, for a reasonable sample size N, it is possible to show that, 
starting for instance from the empirical median 6o with an adequate choice 
of the rate 7 ^, the sequence 

e{t) = 0 (t-l)+ 7 i— 0 (t-l)) 

converges to the true MLE. The complexity of this algorithm is typically 
of order 2T x N'^ if T > 1 is the number of iterations, due the tedious 
computations to evaluate the kernel density estimator (and its derivatives) 
at all points Xi — 6{t — 1). It is thus relevant in this case to try to reduce it 
by means of (Poisson) survey sampling. The iterations of such an algorithm 
would be then of the form 


9{t) = 9{t - 1) + 7t^ ^ ^SNiXj - 9{t - 1), 9{t - 1)), 

i=i 

N 

= ri. 
i=i 


As shown in section 4.3, the optimal choice would be to choose pj proportional 
to |sAr(Xj — 9(t — 1), 9(t — 1))| at the t-th iteration: 


. (gu _ j.'i _ N„\MXj-S{t-i), 


( 12 ) 


Unfortunately this is not possible because s is unknown and replacing s{x — 6) 
by 'sn{x — 9{t — 1), 9{t — 1)) in (12) yields obvious computational difficulties. 
For this reason, we suggest to use the (much simpler) Poisson weights: 


N 


i=i 
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Figure 4: Evolution of the estimator of the location parameter 0 = 0 of the balanced 
Gaussian mixture with the number of iterations in the HTGD (solid red), mini-batch 
SGD (dashed green) and GD (dotted blue) algorithms 


Fig. [^depicts the performance of the HTGD algorithm when 6 = 0 and 
f{x) is a balanced mixture of two Gaussian densities with means 4 and —4 
respectively and same variance = 1, compared to that of the usual SGD 
method. Based on a population sample of size N = 1000, the HTGD and 
SGD methods have been implemented with n = 10 and T = 3000 iterations, 
whereas 30 iterations have been made for the basic GD procedure (with 
n = N = 1000) so that the number of gradient computations is of the same 
order for each method. For each instance of the algorithms we took equal 
to the median of the population. 
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Figure 5: Evolution of the estimator of the location parameter 0 = 0 of the balanced 
Gaussian mixture with the number of iterations in the HTGD (solid blue) and mini-batch 
SGD (dashed red) algorithms over 50 populations 


Min. 

Median 

Max. 

Mean 

S.D. 

HTGD 

9 -0.35 

0.006 

0.29 

0.014 

0.16 

SGD 

9 -0.38 

-0.036 

0.42 

0.025 

0.22 

GD 

9 -0.52 

-0.162 

0.70 

0.20 

0.45 


Table 3: Statistics on the global behavior of the final estimates of the location parameter 
across the 50 simulations 


6. Conclusion 

Whereas massively parallelized/distributed approaches combined with 
random data splitting are now receiving much attention in the Big Data con¬ 
text, the present paper explores an alternative way of scaling up statistical 
learning methods, based on gradient descent techniques. It hopefully paves 
the way for incorporating efficiently survey techniques into machine-learning 
algorithms in order to exploit Big Data. Precisely, it shows how survey sam- 
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pling can be used in order to improve the accuracy of the stochastic gradient 
descent method for a hxed number of iterations, while preserving the com¬ 
plexity of the procedure. Beyond theoretical limit results, the approach we 
promote is illustrated by promising numerical experiments. 


Appendix A - Technical proofs 

Proof of Theorem 

Write the sequence as 

e{t -Fl) = 6{t) - -f{t)V eLN{0{t)) + 

where we set rjt+i = + VeL]sf{6{t)), so that —VeLN{0{t)) appears 

as the mean field of the algorithm and as the noise term. Consider the 
hltration F = where is the cr-held generated by Ci ..., et-i for 

f > 1 and Zi, ..., Ztv (respectively (Zi, Wi), ..., (Z^v, Wv) in presence of 
extra information). We have | = 0 for all t > 1, as well as: 


i=i 


+ 


E 




\Tri{e{t))Trj{e{t)) 


l]Ve'4^iZ„e{t)fVef{Z„eit)) 


< 


N 

E 

i=l 


sup 

9£IC 


|V,^(Z„0)|| 


'n-fO) 


sup ||V0^(Zi,6')|| 


ee/c 


N 


||V6lt/’(Zj, 

> sup-^ 

^ e&K 


< -f-oo. 


. ^=1 


The consistency result thus holds true under the stipulated assumptions, see 
Theorem 2 in [TH] or Theorem 2.2 in [2E] for instance. 


Proof of Theorem 

As observed in the preceding proof, {rit}t>i is a sequence of increments of a 
d-dimensional square integrable martingale adapted to the hltration if. The 
proof is a direct application of Theorem 1 in [HT] and consists in checking 
that the hypotheses of this result are fulhlled. Observe that the required 
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conditions for the mean field hold true. Considering next the noise sequence 
of the algorithm, notice hrst that supi>o IE[| |r7t+i| | G V} < +oo 

for any b > 2. Indeed, we have 


sup||77t+i||I{0(f) e V} < 

t>0 


2 ^ ||V,^(Z„0)|| 


In addition, we have \ 


r( 6 '(t)) for all f > 1, where: V 6 * G 0, 


N 




1 1 - TlijO) 

T^i{9) 




2=1 






Vei^{Zi,d)Vei^{Z,,df 


By virtue of the continuity assumptions, we can apply Lebesgue’s Dominated 
Convergence Theorem : as t —>■ +oo, r( 0 (t)) —)■ T* = r( 6 **) on the event 
£{6*). This concludes the proof. 


Proof of Proposition 

Observe that, in the case where rj 
be rewritten as 


0, the Lyapunov equation (10) can 


Sp + if-^Spi7 = 


We thus have: 


2||Sfll 


2 

HS 


Tr{H-'-r) = Tr(E[(H-H”'^{e‘)){l"'^(e‘)f] 


1 ^ 

n\\j,Y.PQV,i,(Zi,e 

i=l 


* \ 11 2] 


1 


Y. + 2r))^(Qv,V'(^i, e*)) 

.*=1 i<j 


The desired result can be then derived straightforwardly, by repeating 


the Lagrange multipliers argument used in subsection 4.1 
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Proof of Proposition 

Using the last equality in the previous proof and pi = Nq/N we have: 


2{||E 


1 / 2||2 
p Whs 


1 A N 




* \ 112 


2 = 1 


1 ^ELpW-.«* 




||<3vw>(E,9‘)lr = CA ,( r )/ iv „. 


This proves the first assertion. In addition, one has that 

2/V„{||Ey=||L-||Ey.t|L}=2A'o{l|S‘''ll™-||Sp''tlL+l|Si'tlL 

K N 

i=l *’ 2 


-1/2||2 
^p* \\HSi 


N 


N 


J2\m0i’iZi,9*)\\ =aU01-CN{9*), 


2=1 


which establishes the second assertion. 


Appendix B - Rate Bound Analysis 

Here we establish a rate bound for the HTGD algorithm under the as¬ 
sumption that the mapping 9 i—)■ 'iplz, 9 ) is convex, referred to as Assumption 
4 . Note that assumptions 2 . and 4 . implies that 9 * is unique and Ln is 
I strongly convex on V. For simplicity’s sake, we suppose that the strong 
convexity property holds true on M'’*. The following result relies on standard 
arguments in stochastic approximation, see m, 0 or [ 30 ] . 

Theorem 3. Under Assumptions 1 , 2 and f and for a stepsize 7(f) = 7(0)f“" 
with some constants 7(0) > 0 and a G (1/2,1] (when a = 1 , take 7(0) > 
1/(2/)/, there exists a constant < +00 such that: \/t > 1 , 

Elll«(i)-»‘lt]<(/. (13) 
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PROOF. We restrict ourselves to the case a = 1 and follow the proof of [2]. 
By construction, we have 

l|0(f+i)-rir = Il 0 (f)-rir- 27 (t)rr( 0 (t)f (0(t)-r)+|^^ 

Since 

we get 

E[| 0 (t + 1 ) - rn e{t)] = \\e{t) - rf - 2 -f{t)VF{e{t)f{e{t) - e*) 

The strong convexity property gives 

L^(9{t)) - L^{9*) < VL^{9{t)f(e{t) - 9*) - ^-\\9{t) - 9*f 

and 

L^{9*) -L^(9{t)) < - ^-\\9{t) - 9*\\^ 

so that 

l\\9it) - 9*f ^ VLN{9{t)f(9{t) - 9*). 

Combining this inequality with the previous one and taking the expectation, 
we obtain 

E[||9(t + 1) - 9-||2] < E|||9(i) - 9-f](l - 27(«)i) + 7(«)"E|l|fy(9(i))H"l- 

Under Assumption 1 , we have E[||/^^( 0 (t))|p] < D for some constant D > 0 . 
Using this bound and iterating the recursion, we hnally obtain 

i i t 

E|l|9(i+l)-9-f] « E|||9(l)-9-l|2]P[(l-2/7(j))+B^7(i)" If 

j=l j=l k=j+l 

with the convention nfc=t+i(l~2Z7(fc)) = 1 We now substitute the expression 
of 7(f) and, using the following classical inequalities 

1 + a; ^ 
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and 


log(t + 1 ) - log(j + 1 ) ^ 


we get 


E\\e{t +1) - e*\\^ ^ 


k=j+l 

(E\m-0r + D±-^) 

j=i j 


{t + 

where D is a positive constant. Since 7 ( 0 ) > 1 /( 2 /), we have 

' 1 ^ t2i7(0)-l 

^ j2-2Z7(0) ^ 2 / 7 ( 0 ) - 1 

and we hnally obtain the desired bound. 
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